Frontiers in Artificial Intelligence
○ Frontiers Media SA
Preprints posted in the last 30 days, ranked by how well they match Frontiers in Artificial Intelligence's content profile, based on 11 papers previously published here. The average preprint has a 0.08% match score for this journal, so anything above that is already an above-average fit.
Roesler, M. W.; Wells, C.; Schamberg, G.; Gao, J.; Harrison, E.; O'Grady, G.; Varghese, C.
Show abstract
BackgroundPredictive models employing machine learning algorithms are increasingly being used in clinical decision making, and improperly calibrated models can result in systematic harm. We sought to investigate the impact of class imbalance correction, a commonly applied preprocessing step in machine learning model development, on calibration and modelled clinical decision making in a large real-world context. MethodsA histogram boosted gradient classifier was trained on a highly imbalanced national dataset of >1.8 million patients undergoing surgery, to predict the risk of 90-day mortality and complications after surgery. Class imbalance correction strategies including random oversampling, synthetic minority oversampling technique, random under-sampling, and cost-sensitive learning were compared to the natural distribution ( natural). Models were tested and compared with classification metrics, calibration plots, decision curve analysis, and simulated clinical impact analysis. ResultsThe natural model demonstrated high performance (AUROC 0.94, 95% CI 0.94-0.95 for mortality; 0.84, 95% CI 0.84-0.85 for complications) and calibration (log loss 0.05, 95% CI 0.04-0.05 for mortality; 0.23, 95% CI 0.23-0.24 for complications). Class imbalance mitigation (CSL, ROS, RUS, and SMOTE) did not improve AUROC or AUPRC but increased recall and F1 scores at the expense of precision and accuracy. However, these methods severely compromised model calibration, leading to significant over-prediction of risks (up to a 62.8 % increase) as further evidenced by increased log loss across all mitigation techniques. Decision curve analysis and clinical scenario testing confirmed that the natural model provided the highest net benefit. ConclusionClass imbalance correction methods result in significant miscalibration, leading to possible harm when used for clinical decision making.
Ray, P.
Show abstract
Thyroid carcinoma is one of the most prevalent endocrine malignancies worldwide, and accurate preoperative differentiation between benign and malignant thyroid nodules remains clinically challenging. Diagnostic methods that medical practitioners use at present depend on their personal judgment to evaluate both imaging results and separate clinical tests, which creates inconsistency that leads to incorrect medical evaluations. The combination of radiological imaging with clinical information systems enables healthcare providers to enhance their capacity to make reliable predictions about patient outcomes while improving their decision-making abilities. The study introduces a deep learning framework that utilizes multiple data sources by combining magnetic resonance imaging (MRI) data with clinical text to predict thyroid cancer. The system uses a Vision Transformer (ViT) to obtain advanced MRI scan features, while a domain-adapted language model processes clinical documents that contain patient medical history and symptoms and laboratory results. The cross-modal attention system enables the system to merge imaging data with textual information from different sources, which helps to identify how the two types of data are interconnected. The system uses a classification layer to classify the fused features, which allows it to determine the probability of cancerous tumors. The experimental results show that the proposed multimodal system achieves better results than the unimodal base systems because it has higher accuracy, sensitivity, specificity, and AUC values, which help medical personnel to make better preoperative decisions.
Pham, T. D.
Show abstract
ObjectiveThis study investigates whether incorporating physiological coupling concepts into neural network design can support stable and interpretable feature learning for histopathological image classification under limited data conditions. MethodsA physiologically inspired architecture, termed CardioPulmoNet, is introduced to model interacting feature streams analogous to pulmonary ventilation and cardiac perfusion. Local and global tissue features are integrated through bidirectional multi-head attention, while a homeostatic regularization term encourages balanced information exchange between streams. The model was evaluated on three histopathological datasets involving oral squamous cell carcinoma, oral submucous fibrosis, and heart failure. In addition to end-to-end training, learned representations were assessed using linear support vector machines to examine feature separability. ResultsCardioPulmoNet achieved performance comparable to several pretrained convolutional neural networks across the evaluated datasets. When combined with a linear classifier, improved classification performance and higher area under the receiver operating characteristic curve were observed, suggesting that the learned feature embeddings are well structured for downstream discrimination. ConclusionThese results indicate that physiologically motivated architectural constraints may contribute to stable and discriminative representation learning in computational pathology, particularly when training data are limited. The proposed framework provides a step toward integrating physiological modeling principles into medical image analysis and may support future development of transferable and interpretable learning systems for histopathological diagnosis.
Pradhan, A. M.; Shetty, V. A.; Gregor, C.; Graham, J. H.; Tusing, L.; Hirsch, A. G.; Hall, E.; Troiani, V.; Davis, M. P.; Bieler, D. L.; Romagnoli, K. M.; Kraus, C. K.; Piper, B. J.; Wright, E. A.
Show abstract
IntroductionRecreational and medical cannabis use (CU) information is often available within the electronic health record (EHR) in a format that is impractical for health care provider use. Transformation of free-text EHR documentation in notes to discrete elements is possible using natural language processing (NLP) and has the potential to characterize CU efficiently. The objective of this study was to develop an NLP algorithm to identify documentation of CU within EHR unstructured clinical notes. MethodsWe identified EHR notes with cannabis-related terminologies through a keyword search among all Geisinger patients with at least one encounter between 1/1/2013 and 6/30/2022. We trained four NLP models to classify notes into six categories based on time, context, and reliability of CU documentation identified through manual annotation. We compared the demographic characteristics of patients with positive classification for CU using the best-performing model to those of the overall population. ResultsOf the over 1.7 million eligible patients, 150,726 (8.6%) were flagged as cannabis users. The Bio-ClinicalBERT, a transformer-based NLP model, achieved close to human performance in classifying CU (weighted Precision=91.4, Recall=93.3, F-score=92.4). Cannabis users had higher BMI and were at least nine-fold more likely to use tobacco, alcohol, and illicit substances. ConclusionOur study evaluated the prevalence of CU documentation across the entire corpus of EHR notes data without population segmentation. The NLP methodologies used achieved performance close to that of human annotation and laid the foundation for identifying and classifying CU within unstructured data sources, with future applications in research and patient care. Plain Language SummaryMarijuana, also known as cannabis, may impact the health of patients, yet it is not routinely captured in medical records, and when documented, it is often found in unstructured formats (e.g., progress notes) rather than in discrete fields. Incomplete and unstructured capture limits many functional capabilities within the EHR that enhance patient care (e.g., drug interactions, notifications) and limit researchers from identifying patients routinely exposed to marijuana use. The transformation of free-text documentation of cannabis use (CU) into discrete elements can be performed using natural language processing (NLP). The objective of this study was to develop an NLP model to identify CU in unstructured clinical notes in the EHR. We examined the EHRs of Geisinger patients in Pennsylvania over a 10-year period. Among 1.7 million patients, 9% were identified as CU. One of the NLP models tested, Bio-ClinicalBERT, achieved the highest performance. Cannabis users had a higher BMI and were ten-fold more likely to be tobacco users, ten-fold more likely to use alcohol, and nine-fold more likely to use illicit substances. NLP can be used to better understand the risks and benefits of CU at a population level and may improve patient identification to assist clinical decision-making. Future CU epidemiological research should continue to explore other avenues to automate and improve CU documentation by leveraging rapidly evolving technologies, such as artificial intelligence-driven tools.
Kumar, S. N.; K S, G.; Chinnakanu, S. J.; Krishnan, H.; M, N.; Subramaniam, S.
Show abstract
Non-alcoholic fatty liver disease (NAFLD) is a globally prevalent hepatic condition caused by the buildup of fat in the liver. It is frequently associated with metabolic comorbidities such as hypertension, cardiovascular disease (CVD), and prediabetes. However, early detection remains challenging due to the asymptomatic progression, and existing primary diagnostic methods, such as imaging or liver biopsy, are often expensive and inaccessible in rural areas. This study proposes a two-stage, interpretable machine learning pipeline for the non-invasive and cost-effective prediction of NAFLD and its key comorbidities using routine clinical parameters. The NAFLD prediction model was developed using the XGBoost algorithm, trained on a hybrid dataset that combines real patient data with rule-based synthetic data generated by simulating clinically plausible cases. Upon NAFLD-positive prediction, three separate XGB models, trained on data labelled based on thresholds, assess individual risks for hypertension, cardiovascular disease, and prediabetes. Explainability is obtained using SHAP (SHapley Additive exPlanations), which provides insight into feature relevance, while biomarker radar plots help in the visual interpretation of comorbidities. A user-friendly Streamlit interface enables real-time interaction with the tool for potential clinical application. The NAFLD model demonstrated robust performance, while the models used for predicting comorbidities achieved perfect performance, which may be a reflection of the limited dataset size used in the second stage. This work underscores the potential of AI-driven tools in NAFLD diagnosis, particularly when combined with explainable AI methods.
Pemmasani, S. K.; Athmakuri, S.; R G, S.; Acharya, A.
Show abstract
Neurological health score (NHS), indicating the health of brain and nervous system, helps in identifying high risk individuals, and in recommending lifestyle modifications. In the present study, we developed NHS based on genetic, lifestyle and biochemical variables associated with eight neurological disorders - dementia, stroke, Parkinsons disease, amyotrophic lateral sclerosis, schizophrenia, bipolar disorder, multiple sclerosis and migraine. UK Biobank data from Caucasian individuals was used to develop the model, and the data from individuals of Indian ethnicity was used to validate the model. Logistic regression and XGBoost algorithms were used in selecting the significant variables for the disorders. NHS developed from the selected variables was found to be very significant after adjusting for age and sex (AUC:0.6, OR: 0.95). Higher NHS was associated with a lower risk of neurological disorders and better social well-being. Highest NHS group (top 25%) showed 1.3 times lower risk compared to the rest of the individuals. Results of our study help in developing a framework for quantifying the neurological health in clinical setting.
Liu, R.; Azzam, M.; Zabik, N.; Wan, S.; Blackford, J.; Wang, J.
Show abstract
In 2024, approximately 30% of U.S. adolescents reported having consumed alcohol at least once in their lifetime, with about 25% of these individuals engaging in binge drinking. Adolescent alcohol use is associated with neurodevelopmental impairments, elevated risk of later alcohol use, and mental health disorders. These findings underscore the importance of identifying the variables driving adolescent alcohol use and leveraging them for early identification and targeted intervention. Previous studies have typically developed machine-learning classification models that use neuroimaging data in combination with limited clinical measurements. Neuroimaging data are expensive and difficult to obtain at scale, whereas clinical measures are more practical for large-scale screening due to their low cost and widespread accessibility. However, clinical-only approaches for alcohol drinking classification remain largely underexplored. Furthermore, prior studies have often focused on adults, limiting generalizability to the broader adolescent population. Additionally, confounding factors such as age and substance use, which are strongly correlated with alcohol consumption, have often been inadequately addressed, potentially inflating classification performance. Finally, class imbalance remains a persistent challenge, with prior attempts yielding only limited improvements. To address these limitations, we propose FocalTab, a framework that integrates TabPFN with focal loss for robust generalization and effective mitigation of class imbalance. The approach also incorporates an initial preprocessing step to remove confounding factors to account for age and substance-use. We compare FocalTab against state-of-the-art methods across different variable selections and dataset settings. FocalTab achieves the highest accuracy (84.3%) and specificity (80.0%) in the most stringent setting, in which both age and substance use variables were excluded, whereas competing models drop to near-chance specificity (12-24%). We further applied SHapley Additive exPlanations (SHAP) analysis to identify key clinical predictors of drinker classification, supporting enhanced screening and early intervention.
Vanegas Mueller, E.; Harford, M.; He, L.; Banerjee, A.; Leeson, P.; Villarroel, M.
Show abstract
Sudden cardiac death risk is 2-3-fold higher in athletes than in non-athletes. We classify sports-related cardiac arrhythmias using a novel explainability framework comprising data analysis, model interpretability, post-hoc visualisation, and systematic assessment. Two neural networks--one with interpretable sinc convolution and one with standard convolution--were trained on general-population ECGs (PhysioNet, n=88,253, 30 arrhythmias, three continents) and tested on professional footballers (PF12RED, n=161) via domain adaptation for normal sinus rhythm (NSR), sinus bradycardia (SB), incomplete right bundle branch block (IRBBB), and T-wave inversion (TWI). Sinc convolution achieved superior NSR detection (AUROC 0.75 vs 0.70), whilst standard convolution excelled at SB (0.74 vs 0.73), IRBBB (0.66 vs 0.58), and TWI (0.59 vs 0.54). Gradient-weighted Class Activation Mapping revealed that sinc models focus on physiologically relevant ECG segments (the PR interval for NSR/SB and the T wave for TWI). We hypothesise that sinc convolution better captures periodic rhythms but struggles with complex morphological patterns, suggesting architectural choice should align with underlying cardiac pathophysiology. Graphical abstractAbbreviations: AI, artificial intelligence; AUPRC, area under the precision-recall curve; AUROC, area under the receiver operating characteristic curve; Conv, convolution; ECG, electrocardiogram; Grad-CAM, gradient-weighted class activation mapping; IAVB, first-degree atrioventricular block; IRBBB, incomplete right bundle branch block; LAD, left axis deviation; LBBB, left bundle branch block; LVH, left ventricular hypertrophy; NSR, normal sinus rhythm; QT, QT interval; RAD, right axis deviation; RBBB, right bundle branch block; RVH, right ventricular hypertrophy; SA, sinus arrhythmia; SB, sinus bradycardia; TWI, T-wave inversion; xAI, explainable artificial intelligence. O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=123 SRC="FIGDIR/small/26346628v1_ufig1.gif" ALT="Figure 1"> View larger version (60K): org.highwire.dtl.DTLVardef@15c80a4org.highwire.dtl.DTLVardef@1c1f2org.highwire.dtl.DTLVardef@1641ee0org.highwire.dtl.DTLVardef@272fec_HPS_FORMAT_FIGEXP M_FIG C_FIG
Perez Claudio, E.; Horvat, C.; Au, A. K.; Clark, R. S. B.; Taylor, M. W.; Cooper, G. F.; Li, R.; Nourelahi, M.; Hochheiser, H.
Show abstract
Machine learning adoption in clinical decision support systems remains limited by concerns about transparency and robustness. Causal structure learning (CSL) combined with expert knowledge may address these concerns by identifying potentially causal predictors, enabling more interpretable and clinically aligned models. In this study, we show that by integrating clinician expertise with CSL algorithms we can identify plausible causal drivers of acquired acute brain dysfunction (ABD) in the pediatric intensive care unit (PICU), which enables the development of parsimonious predictive models without substantial loss in performance. To do so, we analyzed 18,568 PICU encounters from the University of Pittsburgh Medical Center Childrens Hospital (2010-2022) and elicited knowledge from experienced clinicians. Encounters with acquired ABD were defined using the validated ABD computable phenotype. Expert knowledge was elicited from four clinicians through iterative interviews to construct a consensus directed acyclic graph (DAG). Clinician consensus achieved acceptable inter-rater reliability (Fleiss Kappa = 0.62) after two rounds of interviews and identified 16 biomarkers as potential causes of acquired ABD. Two CSL algorithms, GOLEM and PC-MB, were applied to enrich the clinicians consensus DAG. The PC-MB algorithm showed 78% concordance with expert consensus, while GOLEM showed 46%. Together, the CSL algorithms identified seven biomarkers as potential causes that were not included in the clinicians DAG: blood urea nitrogen, creatinine, dobutamine, glucose, potassium, PTT, SpO2. Using multiple variations of the enriched DAGs, XGBoost models were trained using biomarkers identified as potential causes of acquired ABD; these were evaluated primarily by area under the precision-recall curve (AUPRC). Models trained on the intersection of clinician consensus and PC-MB DAGs achieved an AUPRC of 0.79 (95% CI: 0.75-0.82) using only 14 biomarkers, compared to 0.81 (95% CI: 0.78-0.84) for the control model using all 45 biomarkers. When restricted to vitals and laboratory results alone, the best-performing model achieved an AUPRC of 0.77. Combining clinical expertise with causal structure learning enables the identification of causal hypotheses consistent with the clinical understanding of the participating clinicians and the development of parsimonious predictive models for acquired ABD in the PICU.
Hoe, Z. Y.; Ding, R.-S.; Chou, C.-P.; Hu, C.; Lee, C.-H.; Tzeng, Y.-D.; Pan, C.-T.; Lee, M.-C.; Lee, E. K.-L.
Show abstract
BackgroundBreast cancer-related lymphedema (BCRL) is a common complication following breast cancer treatment. While lymphoscintigraphy is considered the diagnostic gold standard, it is unsuitable for routine periodic monitoring or assessment of treatment efficacy. Shear wave elastography (SWE) offers a possible alternative, but traditional modes of operation limit its potential. Proposed SolutionsThe Holder-Optimized Elastography (HOE) method is introduced to eliminate pressure issues introduced by manual operation of ultrasound probes by stabilizing them above the cutis. MethodsThe HOE method was used to acquire ARFI images of high-velocity areas (HVAs, with shear wave velocity greater than 7 m/s) in limbs with and without BCRL (as confirmed and characterized by lymphoscintigraphy) in two cohorts of 15 and 125 patients. ResultsThe HOE method enabled ARFI elastography to directly and consistently visualize the effects caused by both obstructed lymphatic vessels and intraluminal lymphatic fluid as HVAs, whereas traditional hand-held methods did not. Inter-limb differences in HVA burden showed moderate diagnostic performance for detecting BCRL and grading obstruction with modest sensitivity. However, there was systematic underestimation of both early and confluent advanced lesions. ConclusionHOE-based HVA imaging has potential for rapid and non-invasive monitoring of lymphedema course and treatment response and may serve as a useful adjunct to existing diagnostic tools for BCRL. However, further technical refinements and quantitative analytic methods will be required to fully exploit the richer SWV information provided by HOE and to enhance the diagnostic utility of HVAs. Summary StatementThe Holder-Optimized Elastography method ("HOE" method) increases the diagnostic capability of ARFI elastography for breast cancer-related lymphedema, allowing for the non-invasive detection of some lymphatic obstructions but not all. Key ResultsThe Holder-Optimized Elastography (HOE) method revealed the effects caused by fluid-filled lymphatic vessels as "High-Velocity Areas" (HVAs), which are difficult to detect by conventional methods. HVA counts for detecting lymphedema (any obstruction vs. no obstruction) showed high specificity (0.86-1.00) but low sensitivity (0.57-0.67). Conversely, HVA counts for staging lymphedema (i.e. total vs. partial obstruction) showed high sensitivity (up to 1.00) but low specificity (0.48-0.66). The inter-limb difference of HVAs counted in whole-limb scans between affected and unaffected limbs (aka, the "Global Mean Difference") provided the most balanced diagnostic performance (sensitivity 0.67-0.79, specificity 0.88-0.89).
Castelo, A.; O'Connor, C.; Gupta, A. C.; Anderson, B. M.; Woodland, M.; Altaie, M.; Koay, E. J.; Odisio, B. C.; Tang, T. T.; Brock, K. K.
Show abstract
Artificial intelligence (AI) based segmentation has many medical applications but limited curated datasets challenge model training; this study compares the impact of dataset annotation quality and quantity on whole liver AI segmentation performance. We obtained 3,089 abdominal computed tomography scans with whole-liver contours from MD Anderson Cancer Center (MDA) and a MICCAI challenge. A total of 249 scans were withheld for testing of which 30, MICCAI challenge data, were reserved for external validation. The remaining scans were divided into mixed-curation and highly-curated groups, randomly sampled into sub-datasets of various sizes, and used to train 3D nnU-Net segmentation models. Dice similarity coefficients (DSC), surface DSC with 2mm margins (SD 2mm), the 95th percentile of Hausdorff distance (HD95), and 2D axial slice DSC (Slice DSC) were used to evaluate model performance. The highly curated, 244-scan model (DSC=0.971, SD 2mm=0.958, HD95=2.98mm) performed insignificantly different on 3D evaluation metrics to the mixed-curation 2,840-scan model (DSC=0.971 [p>.999], SD 2mm=0.958 [p>.999], HD95=2.87mm [p>.999]). The 710-scan mixed-curation (Slice DSC=0.929) significantly outperformed the highly curated, 244-scan model (Slice DSC=0.923 [p=0.012]) on the 30 external scans. Highly curated datasets yielded equivalent performance to datasets that were a full order of magnitude larger. The benefits of larger, mixed-curation datasets are evidenced in model generalizability metrics and local improvements. In conclusion, tradeoffs between dataset quality and quantity for model training are nuanced and goal dependent.
Singh, D. B.; Dawadi, P. R.; Dangi, Y.
Show abstract
BackgroundTuberculosis (TB) remains a major public health challenge in Nepal, with incidence rates substantially higher than global estimates. Accurate forecasting of TB incidence is essential for early warning systems, resource allocation, and targeted interventions. This study aimed to develop and validate a hybrid Seasonal Autoregressive Integrated Moving Average (SARIMA) and Convolutional Neural Network Auto-Regressive (CNNAR) model for TB incidence forecasting in Nepal. MethodsMonthly TB incidence data (January 2015 to December 2024) were obtained from the National Tuberculosis Control Center (NTCC), Nepal. A hybrid SARIMA-CNNAR model was developed, where SARIMA modeled linear seasonal trends and CNNAR captured nonlinear patterns in the residuals. Hyperparameters were optimized using grid search with 5-fold cross-validation. Model performance was evaluated using Mean Absolute Error (MAE), Root Mean Square Error (RMSE), Mean Absolute Percentage Error (MAPE), and R2 on the 2024 test set. Structural break analysis and sensitivity analysis assessed model robustness. The hybrid model was compared against standalone SARIMA, CNNAR, and three state-of-the-art benchmarks: Long Short-Term Memory (LSTM), Facebook Prophet, and XGBoost. ResultsTB incidence in Nepal increased from a monthly average of 2,048 cases in 2015 to 3,447 in 2024 (68.4% increase). The hybrid SARIMA-CNNAR model demonstrated strong performance with test set metrics of MAE=248.35, RMSE=294.31, MAPE=7.2%, and R2=0.79. Comparative performance: CNNAR (MAE=251.08, RMSE=336.55, MAPE=7.7%, R2=0.73); LSTM (MAE=267.91, RMSE=324.55, MAPE=7.5%, R2=0.75); XGBoost (MAE=314.74, RMSE=373.99, MAPE=8.5%, R2=0.66); Prophet (MAE=371.15, RMSE=478.40, MAPE=10.4%, R2=0.45); SARIMA (MAE=401.11, RMSE=503.93, MAPE=10.99%, R2=0.39). All models captured seasonal peaks in March-May and July-August, with forecasts for 2025 indicating continued seasonal patterns. Sensitivity analysis confirmed robustness with <5% metric variation across parameter configurations. ConclusionsThis first validated hybrid model for TB prediction in Nepal demonstrates high forecasting accuracy by integrating linear seasonal modeling with nonlinear pattern detection. The approach offers a robust tool for evidence-based public health planning in resource-limited settings and it is suitable for integration into national surveillance systems. Author SummaryTuberculosis remains a major public health challenge in Nepal, with cases increasing substantially over the past-decade. In this study, we developed a computer model that combines two different forecasting ap proaches: one that captures regular seasonal patterns and another that learns complex trends from data to predict monthly TB cases. Using ten years of national surveillance data, our hybrid model achieved high accuracy in forecasting TB incidence, outperforming standard approaches including SARIMA, PROPHET, CNNAR, LSTM neural networks, and XGBoost. The model successfully predicted seasonal peaks in March-May and July-August, with forecasts for 2025 suggesting continued high case numbers. These predictions can help Nepals health authorities prepare by pre-positioning diagnostic supplies, scheduling additional staffs during peak months, and targeting awareness campaigns. The modeling approach is desig ned to be adaptable for other diseases and countries with similar health data.
Alkeyeva, R.; Nagiyev, I.; Kim, D.; Nurmanova, B.; Omarova, Z.; Varol, H. A.; Chan, M.-Y.
Show abstract
BackgroundThe growing interest in applying artificial intelligence in personalized nutrition is challenged by the complex nature of dietary advice that must balance health, economic, and personal factors. Though automated solutions using either Linear Programming (LP) or Large Language Models (LLMs) already exist, they have significant drawbacks. LP often lacks personalization, whereas LLMs can be unreliable for precise calculations. ObjectivesTo develop and assess a model that integrates a Mixed Integer Linear Programming (MILP) solver with an LLM to generate personalized meal plans and compare it with standalone LLM and MILP models. MethodsThe proposed hybrid MILP+LLM model first uses an LLM (GPT-4o) to filter a unified food dataset (n=297), which combines regional Central Asian and global food items, according to the users profile. The filtered list of food items is then received by a MILP solver which identifies the set of top 10 optimal solutions. Finally, given this set of solutions, LLM chooses the most appropriate meal plan. The model was evaluated using five synthesized, clinically complex patient profiles sourced from Adilmetova et al. [4]. The performance of this hybrid model was compared against standalone MILP and LLM using 5-point Likert scale with Kruskal-Wallis and post hoc Dunns tests for Nutrient Accuracy, Personalization, Practicality, and Variety. ResultsFindings demonstrated that the proposed MILP+LLM model reached balanced performance achieving scores of more than 3.6 points in all criteria, with high scores in Nutrient Accuracy (3.96), Personalization (3.81), and Practicality (3.99). The standalone LLM model performed the weakest in all criteria, with statistically significant lower scores compared to the other two methods. The standalone MILP model performed best in Nutrient Accuracy (4.93) and in Variety (4.10) but lagged behind the MILP+LLM model in Practicality and Personalization. Kruskal-Wallis and Dunns tests showed MILP and MILP+LLM outperformed LLM across all criteria. MILP was more accurate (p<0.0001), while MILP+LLM model was more practical (p=0.021). ConclusionsThe findings suggest that integrating the LLM with the MILP solver creates a model that combines qualitative personalization with quantitative precision. This model produces comprehensive, reliable meal plans, addressing the limitations of using either model alone.
Oloko-Oba, M. O.; Aslam, A.; Echols, M.; Onwuanyi, A.; Idris, M. Y.
Show abstract
Heart failure (HF) readmission prediction models often rely on manually curated, cross-sectional features and show limited discrimination and calibration. We evaluated whether automated feature engineering via Deep Feature Synthesis (DFS) improves the clinical applicability of HF readmission prediction from lon-gitudinal electronic health record data. Using 355,217 HF hospitalizations from a large U.S. safety-net health system (2010-2025), we compared a clinician-curated baseline feature set to DFS-enhanced features and trained identical models for 30-, 60-, and 90-day read-mission. DFS consistently improved gradient-boosted tree performance, increasing AUROC and AUPRC across all horizons, while logistic regression performance declined. At sensitivity-targeted operating points (80%), DFS improved specificity and positive predictive value for boosted trees, reducing false-positive workload. Calibration also improved for boosted trees at all horizons but not for linear models. These results show that automated feature engineering yields deployment-relevant gains that are strongly model-class dependent. Data and Code AvailabilityThis study uses retrospective electronic health record data from a large urban safety-net healthcare system in the United States. Due to patient privacy, institutional restrictions, and data use agreements, the data are not publicly available. An anonymized version of the code used for data processing, feature engineering, model training, and evaluation will be made available upon acceptance of the paper. Institutional Review Board (IRB)This retrospective study was reviewed and approved by an institutional review board. Full IRB details will be provided in the camera-ready version of the paper if accepted.
Yamamoto, Y.; Ueda, K.; Wakimura, H.; Yamada, S.; Watanabe, Y.; Kawano, H.; Ii, S.
Show abstract
The present study presents a systematic approach for generating data-driven synthetic cerebral aneurysm geometries and evaluating their hemodynamics through computational fluid dynamics. Seven patient-specific aneurysm geometries from the right internal carotid artery were reconstructed from time-of-flight magnetic resonance angiography images and standardized through orientation alignment, followed by non-rigid registration onto a common spherical point cloud as a template. Principal component analysis (PCA) was then applied to the aligned point-cloud data to quantify morphological variability and parameterize shape deformation. The first four principal components captured over 90% of the total variance; however, higher-order components were required to capture the detailed geometrical features of the original geometries. Computational fluid dynamic simulations were performed on the PCA-based synthetic geometries under pulsatile flow conditions to investigate the influence of shape variations on intra-aneurysmal flow patterns, time-averaged wall shear stress (TAWSS), and oscillatory shear index (OSI). The first principal component score (PCS1), which was associated with changes in aneurysm height and dome width, had the strongest effects on TAWSS and OSI levels. Lower PCS1 values, which corresponded to taller and more oblique domes, produced slower adjacent flow and elevated OSI, whereas higher PCS1 values increased TAWSS. The second principal component score primarily modulated lateral geometric asymmetry and further influenced OSI distribution for the lower PCS1 values. Collectively, these findings indicate that PCA-based shape parameterization provides a practical approach for generating synthetic aneurysm datasets and systematically assessing how specific morphological features govern hemodynamic behavior. The proposed approach is expected to contribute to the future development of surrogate modeling and data-driven hemodynamic prediction.
Dharmavaram, S.; Bhanushali, P.
Show abstract
Overcrowding of emergency departments (ED) is now a problem of global health care concern due to the increase in patients. Triage systems have been established for a considerable period. However, their reliability in choosing the appropriate patient and the level of service has undergone much scrutiny. In this paper, we describe a comprehensive machine learning framework aimed at predicting critical emergency department outcomes and enabling dynamic routing decisions. Through the MIMIC-IV-ED database, which comprises more than 440,000 emergency visits, we design and assess varied predictive models, which include classical clinical scores, interpretable ML systems, classical algorithms, and deep learning architectures. We investigate three significant outcomes: hospitalization post-ED visit, critical deterioration (ICU transfer/death within 12 hours), 72-hour re-attendance in ED. The results indicate that gradient boosting algorithms can make better predictions with AUROCs of 0.820, 0.881, and 0.699 as compared to standard clinical scoring systems and complex deep learning models. The interpretable AutoScore framework which combines clinical performance with clinical transparency. We also study patterns of feature importance across prediction tasks. Moreover, we talk about how these can be implemented in real-time clinical workflows. This study builds a reproducible benchmarking platform for ED prediction research. In addition, it presents evidence-based recommendations for intelligent patient routing systems that can help enhance emergency care efficiency and resource utilization while improving patient outcomes in a high-pressure environment.
Islam, N.; Luo, C.; Tong, J.; Polleya, D. A.; Jordan, C. T.; Haverkos, B.; Bair, S.; Kent, A.; Weller, G.
Show abstract
Cox proportional hazard regressions are frequently employed to develop prognostic models for time-to-event data, considering both patient-specific and disease-specific characteristics. In high-dimensional clinical modeling, these biological features can exhibit high collinearity due to inter-feature relationships, potentially causing instability and numerical issues during estimation without regularization. For rare diseases such as acute myeloid leukemia (AML), the sparsity and scarcity of data further complicate estimation. In such cases, data augmentation through multi-site collaboration can alleviate these problems. However, this often necessitates sharing individual patient data (IPD) across sites, which presents challenges due to regulatory barriers aimed at protecting patient privacy. To overcome these challenges, we propose a privacy-preserving algorithm that eliminates sharing IPD across sites and fits a federated penalized piecewise exponential model (FedPPEM) to estimate potential effects of clinical features using summary statistics. This algorithm yields results nearly identical to those from pooled IPD, including effect size and standard error estimates. We demonstrate the models performance in quantifying effects of clinical features and genetic risk classification on overall survival using real-world data from [~]1200 newly diagnosed AML patients across 33 U.S. sites. Although applied in AML context, this model is disease-agnostic and can be implemented in other diseases and clinical contexts.
Al-Garadi, M.
Show abstract
IMPORTANCEAlthough angiotensin-converting enzyme inhibitors (ACEIs) and angiotensin receptor blockers (ARBs) are recommended for people with chronic kidney disease (CKD), they remain underused. Barriers to adherence, such as adverse effects or patient refusal, are frequently embedded within unstructured clinical narratives and are therefore inaccessible to structured data analytics. Scalable natural language processing (NLP) approaches are needed to identify these barriers and support guideline-concordant care. OBJECTIVETo develop and evaluate an NLP model capable of identifying documented reasons for ACEI/ARB non-use within clinical notes of people with CKD in the Veterans Affairs (VA) healthcare system. DESIGN, SETTING, AND PARTICIPANTSThis retrospective study analyzed electronic health record data from 2005 to 2024 including people aged 18 to 80 years with CKD, defined by an estimated glomerular filtration rate (eGFR) of 20-60 mL/min/1.73 m2 and presence of albuminuria, across multiple VA medical centers. NLP models were trained on 1,025 manually annotated notes and further augmented with 4,600 synthetic examples generated through schema-guided large language model prompting. MAIN OUTCOMES AND MEASURESThe primary outcome was model performance in identifying notes containing at least one documented reason for ACEI/ARB non-use, evaluated using F1-score, precision, and recall. Secondary outcomes included model learning curve analyses and the effect of synthetic data augmentation on classification performance. RESULTSThe most common documented reasons for ACEI/ARB non-use were acute kidney injury (29.6%), increased creatinine (12.4%), cough (11.2%), and hypotension-related symptoms (11.1%). Across modeling approaches, training with synthetic data augmentation improved detection of notes containing reasons for non-use. Performance gains were statistically significant across all models (McNemar test, P < .05), with the random forest model using Nomic embeddings achieving the highest performance (F1 score, 0.79; 95% CI, 0.68-0.90). CONCLUSIONS AND RELEVANCEWe identified documented reasons for ACEI/ARB non-use (including both failures to initiate therapy and discontinuation after prior use) from unstructured text using an NLP method that does not require massive, expensive computing at inference time. By augmenting training data with schema-guided synthetic notes, we achieved robust, privacy-preserving performance within an NLP framework. This approach may support scalable clinical decision support systems to promote guideline-concordant prescribing.
Mittelberg, Y.; Stiglitz, D. K.; Kowadlo, G.
Show abstract
BackgroundPersonalized medicine promises to tailor treatments to the individual, but it carries a hidden risk: mistaking statistical noise for actionable clinical insight. Current machine learning approaches often provide predictions, but fail to inform clinicians when those predictions are unreliable. ObjectiveDevelop a deployment-readiness framework that integrates causal inference, interpretable effect-trees, and calibration assessment to distinguish actionable signal from unreliable variation, and to support treatment selection only when the estimated benefit is both reliable and clinically meaningful. MethodsUsing retrospective observational cohort EHR data from the INSPIRE perioperative dataset (N>130,000 surgical operations, 2011-2020), we estimated treatment effects using causal forests with double machine learning, benchmarked against other causal methods to assess convergence. We used the estimated causal effects to create effect-trees and translated estimates into interpretable rules. We validated the treatment recommendations by assessing subgroup calibration to identify which groups were reliable for treatment selection. ResultsIn a prostate procedures case study (neuraxial versus general anesthesia; total N=2,822), neuraxial anesthesia was associated with substantially lower post-operative opioid use (ATE = -1.38 opioid medications, 95% CI [-1.62, -1.15]). The effect-tree produced five clinically interpretable subgroups using BMI, ASA status, and age, with effects ranging from -1.10 to -1.59 opioid medications. Calibration analysis identified four of five subgroups as reliable for deployment (calibration error < 0.08), while one small subgroup (N=250) showed higher calibration error (0.44), illustrating how the framework rates unreliable heterogeneity. ConclusionsIndividual prediction heterogeneity does not automatically justify clinical personalization. By combining effect-trees with calibration, this framework distinguishes actionable heterogeneity from noisy heterogeneity (detectable but unreliable). This approach transforms causal machine learning from a black box into a validated decision support system that enables selective deployment of treatment decision rules.
Arethiya, N. J.; Krammer, L.; David, J.; Bakshi, V.; BasuChoudhary, A.; Bhuiyan, U.; Sen, S.; Mazumder, R.; McNeely, P.
Show abstract
As of early 2026, over 115 million US adults (more than 1 in 3) have prediabetes, a condition with an annual conversion rate of 5%-10% to type 2 diabetes. Total diabetes (diagnosed and undiagnosed) affects approximately 40.1 million Americans, or 12% of the population, with roughly 1.5 million new cases diagnosed annually. Continuous Glucose Monitoring (CGM) provides real-time, 24/7 insights into glycemic variability, detecting dangerous highs, lows, and trends that HbA1c (a 3-month average) misses. It enables, for instance, identification of nocturnal hypoglycemia or postprandial spikes, enhancing personalized, actionable treatment decisions and improving safety. The Artificial Intelligence Ready and Exploratory Atlas for Diabetes Insights (AI-READI) dataset was produced by the National Institutes of Health (NIH) Common Fund Data Ecosystem (CFDE) Bridge2AI program. This dataset offers a rich resource for diabetes research, providing comprehensive biosensor data from over 1,067 participants. However, like many medical datasets, AI-READI contains label inaccuracies due to self-reported health surveys and static HbA1c indicators, which can undermine model effectiveness. We developed a strong classification framework using Convolutional-Bidirectional Long Short-Term Memory (Conv+BiLSTM) to analyze and accurately classify glycemic health states from continuous glucose monitoring time-series data. Our aim was to establish and correct any misclassified labels through hybrid unsupervised-supervised learning methods and validated our results with expert-in-the-loop clinical review. We analyzed 784 participants from the AI-READI dataset, which represented four health states: healthy, prediabetes lifestyle controlled, oral medication, and insulin-dependent. Based on recommendations from the literature and our own expertise, we sought to compare the self-provided "healthy" group labels with a cluster-agnostic, CGM-defined healthy (CGM-H) reference derived from the CGM metrics using K-means clustering (K=6) on standardized CGM summary features to identify CGM-H participants and then applied XGBoost-based iterative label refinement. We identified a misclassification rate of 56.9% (161/283) in the initially labeled "healthy" group. After eight iterations of XGBoost refinement with dual-criterion relabeling ([≥]80% probability + unanimous out-of-fold voting), the cleaned dataset increased CGM-H participants from 122 to 195 for binary classification. Next, we developed a Conv+BiLSTM model combining Convolutional layers (32, 64 filters) for local temporal feature extraction with Bidirectional LSTM layers (64, 32 units) for sequence modeling, using time-series engineered features including rolling statistics, glucose derivatives, and circadian rhythm encoding. Class imbalance was addressed with per-class weighting, and 5-fold stratified cross-validation estimated generalization performance, computing a global decision threshold (0.374) by maximizing Youdens J statistic on concatenated out-of-fold predictions. Additionally, we analyzed heart rate, activity level, and stress and sleep data and validated it against CGM data. The Conv+BiLSTM model achieved ROC-AUC {approx} 0.932 on the held-out test set and 0.907 {+/-} 0.026 in cross-validation, with well-calibrated predictions (Expected Calibration Error = 0.075, temperature scaling T = 1.00). A 3-tier confidence-based decision system achieved 82% detection rate with only 6% OGTT burden, enabling actionable clinical recommendations. This hybrid approach addressed label noise while achieving high discrimination. This framework demonstrates potential for real-time glycemic state monitoring and early intervention in diabetes progression.